{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Merging data produced by three different people to create the manual_title_subset\n",
    "\n",
    "This is the notebook that created List # 4 in the report, the \"manually-checked title subset.\" It can be re-run to recreate the table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from matplotlib import pyplot as plt\n",
    "from collections import Counter\n",
    "from difflib import SequenceMatcher\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load the data, confirm basic format"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1200, 18)"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "j = pd.read_csv('titledata/jessica.csv')\n",
    "j.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1200, 18)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p = pd.read_csv('titledata/patrick.csv')\n",
    "p.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(330, 18)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t = pd.read_csv('titledata/ted.tsv', sep = '\\t')\n",
    "t.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "set(p.columns) == set(j.columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "set(p.columns) == set(t.columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Concatenate three datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "all = pd.concat([j, p, t], sort = False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Confirm that values match our data dictionary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{nan, 'o', 'f', 'm', 'u'}"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "set(all.gender)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In practice, there is no difference between 'u' and nan; they both mean we don't know. The only difference between 'o' and 'u,' in this data, is that one coder has used 'o' in five cases of multiple authorship; the other readers have not done the same thing, so we can't consistently maintain that distinction as part of the dictionary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "all.loc[all.gender == 'o', 'gender'] = 'u'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "all['gender'] = all['gender'].fillna(value = 'u')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There's one row where a coder has double-listed nationality to address dual authorship. This is not something we've done consistently, so let's flatten it out to a single value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "all.loc[all.nationality == 'us| us', 'nationality'] = 'us'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "t['gender'] = t['gender'].fillna(value = 'u')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### confirm the ```category``` field\n",
    "\n",
    "This is a basic element of manual correction, indicating basically what genre a book should be filed under.\n",
    "\n",
    "According to our data dictionary in process.md https://github.com/tedunderwood/meta2018/blob/master/process.md the allowable codes here are\n",
    "\n",
    "    nonfic\n",
    "    reprint\n",
    "    novel\n",
    "    poetry\n",
    "    shortstories\n",
    "    juvenile\n",
    "\n",
    "Some of these are designations of a genre, form, or (in the case of \"juvenile\") audience. We don't postulate actually-crisp boundaries between these categories; for one thing, we're characterizing books, and books often include works from *multiple* genres. But I asked my collaborators to produce best-available generalizations for the purpose of a rough survey. Our goal here is not to study anything in detail, but to start by locating works that most people would consider fiction.\n",
    "\n",
    "The \"reprint\" category is anomalous: it's not a genre or form. These are genuine works of fiction, but their first appearance in Hathi is more than 25 years after their original date of publication. We will correct the \"firstpub\" date, but researchers still may wish to exclude these from a random sample based on \"firstpub,\" since some of these works didn't actually circulate widely near first date of publication.\n",
    "\n",
    "In subsequent conversation we added\n",
    "\n",
    "    drama, and\n",
    "    shortstories|juvenile\n",
    "\n",
    "But I am on reflection going to simplify that last category by letting \"juvenile\" be the dominant tag. This project has been targeted from the beginning at adult fiction, and formal divisions within the juvenile category are not something we can really pretend to have addressed.\n",
    "\n",
    "In reality, we recorded a more complex set of categories, because coders didn't always know which category to treat as dominant:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'drama',\n",
       " 'juvenile',\n",
       " 'juvenile | shortstories',\n",
       " 'juvenile|novel',\n",
       " 'juvenile|shortstories',\n",
       " 'nonfic',\n",
       " 'nonfic | reprint',\n",
       " 'nonfic|juvenile',\n",
       " 'nonfic|poetry',\n",
       " 'novel',\n",
       " 'novel|juvenile',\n",
       " 'poetry',\n",
       " 'reprint',\n",
       " 'shortstories',\n",
       " 'shortstories | poetry',\n",
       " 'shortstories|juvenile',\n",
       " 'shortstories|poetry'}"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "set(all.category)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**solution**\n",
    "\n",
    "In this instance, I'm going to impose a strict order of dominance on the categories so we don't have to use pipes and can have a one-to-one mapping here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# We start by cleaning up spaces. Then, we simplify the data\n",
    "# by allowing nonfic and poetry to dominate all categories.\n",
    "# and juvenile to dominate everything that remains.\n",
    "\n",
    "def dominant_category(astring):\n",
    "    ''' Accepts a category string that may contain multiple\n",
    "    \n",
    "    '''\n",
    "    astring = astring.replace(' ', '')\n",
    "    cats = astring.split('|')\n",
    "    if 'nonfic' in cats:\n",
    "        return 'nonfic'\n",
    "    if 'poetry' in cats:\n",
    "        return 'poetry'\n",
    "    if 'drama' in cats:\n",
    "        return 'drama'\n",
    "    if 'reprint' in cats:\n",
    "        return 'reprint'\n",
    "    if 'juvenile' in cats:\n",
    "        return 'juvenile'\n",
    "    if 'shortstories' in cats:\n",
    "        return 'shortstories'\n",
    "    if 'novel' in cats:\n",
    "        return 'novel'\n",
    "    else:\n",
    "        return 'error'\n",
    "\n",
    "all = all.assign(category = all.category.map(dominant_category))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**what do we actually have?**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "poetry 11\n",
      "nonfic 199\n",
      "juvenile 144\n",
      "drama 3\n",
      "reprint 129\n",
      "shortstories 298\n",
      "novel 1946\n"
     ]
    }
   ],
   "source": [
    "allcats = set(all.category)\n",
    "for c in allcats:\n",
    "    print(c, sum(all.category == c))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So in a random sample of 2730 books, 2517 are fiction. But some researchers may want to exclude 144 vols of juvenile fiction. \n",
    "\n",
    "A sample focused on literary *production* might also want to exclude the 129 \"reprints\"; they are first appearing in Hathi significantly (>25 yrs) after their original publication. We have now provided correct first publication dates for these (where we can). But if you're trying to reflect relatively *prominent* works from e.g. the 1820s, it might be misleading to include these \"reprints\" as if they had been randomly sampled *from the 1820s* when their period of popularity may actually be later.\n",
    "\n",
    "### Filling out ```firstpub``` column\n",
    "\n",
    "We have manually entered a first publication date where it's earlier than the automatically-calculated \"latest possible date of composition\" (which intersects attested publication date with e.g. author's date of death).\n",
    "\n",
    "But the column is often left blank in our manual process. This turns it into a column that always holds the earlier of the two dates, or just latestcomp if that's all we have."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def lowestof(row):\n",
    "    if pd.isnull(row['firstpub']):\n",
    "        return int(row['latestcomp'])\n",
    "    else:\n",
    "        latest = int(row['latestcomp'])\n",
    "        first = int(row['firstpub'])\n",
    "        lowest = min(latest, first)\n",
    "        if lowest > 1790:\n",
    "            return lowest\n",
    "        else:\n",
    "            return latest\n",
    "\n",
    "all = all.assign(firstpub = all.apply(lowestof, axis = 1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Breaking the ```reprint``` category out as a distinct column\n",
    "\n",
    "On reflection, it's problematic to include \"reprint\" as a category on the same level as, say \"shortstories.\" Doing that would mean that people who want to include reprints lose any guidance about genre. Let's fix that with some further manual coding."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "reprints = pd.read_csv('titledata/reprints.tsv', index_col = 'docid', sep = '\\t')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def new_category_and_repval(row):\n",
    "    global reprints\n",
    "    \n",
    "    if row['category'] == 'reprint':\n",
    "        docid = row.docid\n",
    "        newcat = reprints.loc[docid, 'category']\n",
    "        firstpub = int(reprints.loc[docid, 'firstpub'])\n",
    "    else:\n",
    "        newcat = row['category']\n",
    "        firstpub = int(row['firstpub'])\n",
    "    \n",
    "    foundat = int(row['inferreddate'])\n",
    "    if firstpub + 25 < foundat:\n",
    "        repval = 'reprint'\n",
    "    else:\n",
    "        repval = 'contemporary'\n",
    "    \n",
    "    return newcat, repval, firstpub\n",
    "    \n",
    "categories, repvals, firstpubs = zip(*all.apply(new_category_and_repval, axis = 1))\n",
    "all = all.assign(category = categories)\n",
    "all = all.assign(hathiadvent = repvals)\n",
    "all = all.assign(firstpub = firstpubs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Resetting certain text columns to original values\n",
    "\n",
    "In creating the data we worked on, I truncated certain columns to a character limit, in order to make the spreadsheet more manageable. Also, although we tried to work in utf-8, there were slip-ups that caused certain portions of the data to be saved in a different encoding. Once that happens, special characters are lost.\n",
    "\n",
    "We can reset those columns using the index, ```docid.``` However, we need to be cautious in certain cases, since there are also manual edits to titles we want to preserve.\n",
    "\n",
    "Doing both things at once becomes a fun algorithmic challenge."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "titlemeta = pd.read_csv('../titlemeta.tsv', \n",
    "                        index_col = 'docid',\n",
    "                       sep = '\\t', low_memory = False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Old preferred because much longer:  0\n",
      "Old preferred because it was close & weird chars in new:  28\n",
      "New preferred:  45\n"
     ]
    }
   ],
   "source": [
    "muchshorter = 0\n",
    "rejected_old_strings = []\n",
    "oldbetter = 0\n",
    "\n",
    "def match_values(row, column_name):\n",
    "    global muchshorter, titlemeta, weirdchars, oldbetter, maybe\n",
    "    \n",
    "    tocorrect = {'The manuscripts of Erdély': 'The Manuscripts of Erdély',\n",
    "                'NhÃ¡Ì‚t Háº¡nh, ThÃ\\xadch': 'Nhá̂t Hạnh, Thích',\n",
    "                 'RÄ\\uf181javaá¹ƒÅ›Ä«, Lakshmaá¹‡a': 'Rājavaṃśī, Lakshmaṇa'}\n",
    "    \n",
    "    docid = row['docid']\n",
    "    if pd.isnull(row[column_name]):\n",
    "        newval = \"\"\n",
    "    else:\n",
    "        newval = row[column_name]\n",
    "        \n",
    "    if pd.isnull(titlemeta.loc[docid, column_name]):\n",
    "        oldval = \"\"\n",
    "    else:\n",
    "        oldval = titlemeta.loc[docid, column_name]\n",
    "        \n",
    "    if newval == oldval:\n",
    "        return newval\n",
    "    elif newval in tocorrect:\n",
    "        return tocorrect[newval]\n",
    "    elif len(oldval) < 1:\n",
    "        return newval\n",
    "    elif len(newval) == 25 and len(oldval) > 25 and newval == oldval[0:25]:\n",
    "        muchshorter += 1\n",
    "        return oldval\n",
    "    else:\n",
    "        matcher = SequenceMatcher(None, oldval, newval)\n",
    "        matchprob = matcher.ratio()\n",
    "        \n",
    "        for char in newval:\n",
    "            if char == 'Ä' or char == '�' or char == 'Ã' or char == 'Å':\n",
    "                matchprob += 0.07\n",
    "            elif char == '?':\n",
    "                matchprob += 0.03\n",
    "                \n",
    "        if matchprob > 0.85:\n",
    "            oldbetter += 1\n",
    "            return oldval\n",
    "        else:\n",
    "            rejected_old_strings.append((oldval, newval))\n",
    "            return newval\n",
    "        \n",
    "besttitles = all.apply(match_values, axis = 1, args = ['shorttitle'])\n",
    "print(\"Old preferred because much longer: \", muchshorter)\n",
    "print(\"Old preferred because it was close & weird chars in new: \", oldbetter) \n",
    "print('New preferred: ', len(rejected_old_strings))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Old preferred because much longer:  0\n",
      "Old preferred because it was close & weird chars in new:  74\n",
      "New preferred:  14\n"
     ]
    }
   ],
   "source": [
    "muchshorter = 0\n",
    "rejected_old_strings = []\n",
    "oldbetter = 0\n",
    "bestauthors = all.apply(match_values, axis = 1, args = ['author'])\n",
    "print(\"Old preferred because much longer: \", muchshorter)\n",
    "print(\"Old preferred because it was close & weird chars in new: \", oldbetter) \n",
    "print('New preferred: ', len(rejected_old_strings))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Old preferred because much longer:  2109\n",
      "Old preferred because it was close & weird chars in new:  1\n",
      "New preferred:  4\n"
     ]
    }
   ],
   "source": [
    "muchshorter = 0\n",
    "rejected_old_strings = []\n",
    "oldbetter = 0\n",
    "bestimprints = all.apply(match_values, axis = 1, args = ['imprint'])\n",
    "print(\"Old preferred because much longer: \", muchshorter)\n",
    "print(\"Old preferred because it was close & weird chars in new: \", oldbetter) \n",
    "print('New preferred: ', len(rejected_old_strings))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Old preferred because much longer:  142\n",
      "Old preferred because it was close & weird chars in new:  0\n",
      "New preferred:  2\n"
     ]
    }
   ],
   "source": [
    "muchshorter = 0\n",
    "rejected_old_strings = []\n",
    "oldbetter = 0\n",
    "bestgenres = all.apply(match_values, axis = 1, args = ['genres'])\n",
    "print(\"Old preferred because much longer: \", muchshorter)\n",
    "print(\"Old preferred because it was close & weird chars in new: \", oldbetter) \n",
    "print('New preferred: ', len(rejected_old_strings))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Old preferred because much longer:  630\n",
      "Old preferred because it was close & weird chars in new:  3\n",
      "New preferred:  7\n"
     ]
    }
   ],
   "source": [
    "muchshorter = 0\n",
    "rejected_old_strings = []\n",
    "oldbetter = 0\n",
    "bestsubjects = all.apply(match_values, axis = 1, args = ['subjects'])\n",
    "print(\"Old preferred because much longer: \", muchshorter)\n",
    "print(\"Old preferred because it was close & weird chars in new: \", oldbetter) \n",
    "print('New preferred: ', len(rejected_old_strings))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "all = all.assign(shorttitle = besttitles,\n",
    "                author = bestauthors,\n",
    "                imprint = bestimprints,\n",
    "                genres = bestgenres,\n",
    "                subjects = bestsubjects)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Redefining categories to avoid misunderstanding\n",
    "\n",
    "In manual coding we used the terms \"novel\" and \"shortstories.\" But these phrases are in reality often misleading. Folktales or anecdotes are not really short stories, and some older or experimental fiction might not be quite \"a novel.\"\n",
    "\n",
    "Let's use looser terms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def remap_categories(cat):\n",
    "    if cat == 'novel':\n",
    "        return 'longfiction'\n",
    "    elif cat == 'shortstories':\n",
    "        return 'shortfiction'\n",
    "    elif cat == 'nonfic':\n",
    "        return 'notfiction'\n",
    "    elif cat == 'poetry':\n",
    "        return cat\n",
    "    elif cat == 'drama':\n",
    "        return cat\n",
    "    elif cat == 'juvenile':\n",
    "        return cat\n",
    "    else:\n",
    "        print(cat)\n",
    "        return cat\n",
    "\n",
    "all = all.assign(category = all.category.map(remap_categories))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A few minor name corrections\n",
    "\n",
    "Still fixing errors caused by some data having been saved outside utf-8."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>realname</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>error</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Aminoff. Constance L�onie Caroline</th>\n",
       "      <td>Aminoff. Constance Léonie Caroline</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Wibberley, Leonard Patrick O�Connor</th>\n",
       "      <td>Wibberley, Leonard Patrick O'Connor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B�lte, Amely</th>\n",
       "      <td>Bolte, Amely</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>De la Motte, Friedrich Heinrich Karl, Baron Fouqu�</th>\n",
       "      <td>De la Motte, Friedrich Heinrich Karl, Baron Fo...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Bouv�, Pauline Carrington Rust</th>\n",
       "      <td>Bouvé, Pauline Carrington Rust</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                                             realname\n",
       "error                                                                                                \n",
       "Aminoff. Constance L�onie Caroline                                 Aminoff. Constance Léonie Caroline\n",
       "Wibberley, Leonard Patrick O�Connor                               Wibberley, Leonard Patrick O'Connor\n",
       "B�lte, Amely                                                                             Bolte, Amely\n",
       "De la Motte, Friedrich Heinrich Karl, Baron Fouqu�  De la Motte, Friedrich Heinrich Karl, Baron Fo...\n",
       "Bouv�, Pauline Carrington Rust                                         Bouvé, Pauline Carrington Rust"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "corrector = pd.read_csv('titledata/name_corrections.tsv', sep = '\\t', index_col = 'error')\n",
    "corrector.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def correct_names(name):\n",
    "    global corrector\n",
    "    if name in corrector.index:\n",
    "        return corrector.loc[name, 'realname']\n",
    "    else:\n",
    "        return name\n",
    "\n",
    "all = all.assign(realname = all.realname.map(correct_names))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "all.to_csv('manual_title_subset.tsv', index = False, sep = '\\t')"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
